Exercise project (TKO_3120 Machine Learning and Pattern Recognition, 5 ECTS)

The goal of this project was to find the best performing supervised machine learning model with optimized hyperparameters to efficiently predict the class of a rice grain based on its extracted characteristics.

The data consisted of images of arborio, basmati, ipsala, jasmine and caracadag rice grains against a black background, 500 samples in total. The project follows the findings of a previous study by Cinar et.al. entitled Identification of Rice Varieties Using Machine Learning Algorithms, and the used samples were randomly picked from a collection of 75 000 images using a seed (50).

For preprocessing, the data was standardized and dimensionality reduction (PCA) was applied for proper visualization of the clusters. Missing values were processed accordingly and the features were extracted based on their morphological, color and shape features. Random Forest, Support Vector Machine and MLP were chosen as the model candidates and the final cross validation was performed as a nested cross-validation which revealed that the model with the best accuracy is the Support Vector Machine.

Original research article:

İ. Çınar and M. Koklu. Identification of rice varieties using machine learning algorithms. Journal of Agricultural Sciences, 28(2):307–325, 2022. doi: 10.15832/ankutbd.862482.

https://dergipark.org.tr/en/download/article-file/1513632

Preparations of the data¶

In [1]:
# Importing necessary libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

# Importing libraries for file manipulation and image processing
import glob, os
import cv2 as cv

# Importing libraries for statistical analysis
from scipy.stats import kurtosis, skew

# Importing libraries for machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RepeatedKFold, GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn import svm
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

# Importing libraries for miscellaneous functionalities
from random import sample, seed
from IPython.display import display, HTML
import warnings
import time

The rice sample images were imported from the link http://www.muratkoklu.com/datasets/vtdhnd09.php. These were separated into folders by rice species: 'Arborio', 'Basmati', 'Ipsala', 'Jasmine', and 'Karacadag'. Then, 100 random images were sampled from each species, totaling 500 images. The seed(50) method was used for reproducibility.

In [6]:
# import data
# set the seed for enabling the reproduction with the same sequence
seed(50)

path = '../data'
folders = ['Arborio', 'Basmati', 'Ipsala', 'Jasmine', 'Karacadag']

all_images = []
subset = []
for folder in folders:
    path_folder = os.path.join(path, folder)
    # all .jpg files from the given folder
    files = glob.glob(os.path.join(path_folder, '*.jpg'))
    # gather all sampled filenames in subset list
    subset.extend(sample(files, 100))
In [7]:
# Gathering the sampled images into a list
image_list =[]
# Iterating through the subset list and reading the images
for image in subset:
    image_list.append(cv.imread(image))

basmati (244).jpg is used as a test image

In [8]:
# Saving the test image
test_image = cv.imread( '../data\\Basmati\\basmati (244).jpg')

The contours of each rice were determined using the findContours function from OpenCV. The contour for the test image was also identified. To visualize these results, both the original test image and its image with the contour were plotted.

To avoid modifying the original image when using drawCountours, a copy of the test image was utilized as input for the function.

In [9]:
# Function for finding contours
def find_contours (image):
    # Grayscale conversion
    img_gray = cv.cvtColor(image, cv.COLOR_BGR2GRAY)
    # Binary thresholding (0 is black, 255 is white. Values greater than 150 = white)
    ret, thresh = cv.threshold(img_gray, 127, 255, 0)
    # Identify contours, RETR_TREE retrieves all the contours and creates a tree hierarchy (suitable for the grains against a black background), CHAIN_APPROX_NONE stores all contour points
    contours, hierarchy = cv.findContours(image=thresh, mode=cv.RETR_TREE, method=cv.CHAIN_APPROX_SIMPLE)
    return contours
In [10]:
# Copy of the test image to avoid overwrite
test_copy = test_image.copy()

# Drawing of the contours to the copy, ContourIdx=-1 to draw all contour lines and LINE_AA for an anti-aliased line
cv.drawContours(image=test_copy, contours=find_contours(test_copy), contourIdx=-1, color=(0,255,0), thickness=1, lineType=cv.LINE_AA)

# Display the original image and the contours
plt.subplot(1, 2, 1)
plt.imshow(cv.cvtColor(test_image, cv.COLOR_BGR2RGB))
plt.title('Original Image')
plt.axis('off')

# Contours Image
plt.subplot(1, 2, 2)
plt.imshow(cv.cvtColor(test_copy, cv.COLOR_BGR2RGB))
plt.title('Contours')
plt.axis('off')
Out[10]:
(-0.5, 249.5, 249.5, -0.5)
In [11]:
# List containing contour values for each image
contours_list = []
for image in image_list:
    contours_list.append(find_contours(image))

In the first part, the images are loaded and gathered to a list (image_list).
The function for finding contours is used to gather all contours to contours_list.
The test image (Basmati (244)) also utilizes the function and a copy is used to display the original image and it's contours
Main source: https://learnopencv.com/contour-detection-using-opencv-python-c/#Drawing-Contours-using-CHAIN_APPROX_NONE

Feature extraction¶

Color features extraction. RGB images are converted to YCbCr format

In [12]:
ycrcb_list = []
# Converting the images to YCrCb color space and appending them to the list
for image in image_list:
    ycrcb_list.append(cv.cvtColor(image, cv.COLOR_BGR2YCrCb))

The images are stored in YCrCb -format to a list (OpenCV's defined ordering for the conversion). The output is in the form of "[1,2,3]" where:

1 = Y (Luma) - The brightness of the pixel
2 = Cr (Chrominance red) - The red difference
3 = Cb (Chrominance blue) - The blue difference

In [13]:
def pixels_in_contour(image, contour):
    # Initialize lists to store Y, Cb, Cr values within contour
    y_values = []
    cr_values = []
    cb_values = []
    # Iterate over each pixel in the image
    for y in range(image.shape[0]):
        for x in range(image.shape[1]):
            # Check if the pixel is within the contour
            if cv.pointPolygonTest(contour, (x, y), measureDist=False) == 1:
                # Get YCrCb values at the pixel location
                y_value, cr_value, cb_value = image[y, x] # indexing is reversed
                # Append Y, Cb, Cr values to lists
                y_values.append(y_value)
                cr_values.append(cr_value)
                cb_values.append(cb_value)
    return y_values, cr_values, cb_values
In [14]:
all_ycrcb_values = {}
 # Iterate over each image and its contours and add Y, Cb, Cr values to the dictionary
for indx, image in enumerate(ycrcb_list):
    y_values, cr_values, cb_values = pixels_in_contour(image, contours_list[indx][0])
    all_ycrcb_values[indx] = {'y_values': y_values, 'cr_values': cr_values, 'cb_values': cb_values}
In [15]:
def comp_stats(component):
    # Calculate the mean, variance, skewness, and kurtosis
    mean = np.mean(component)
    var = np.var(component)
    skewness = skew(component)
    kurt = kurtosis(component)
    return mean, var, skewness, kurt
In [ ]:
y_stats, cr_stats, cb_stats = [], [], []

# Add all Y, Cb, Cr stats to respective lists
for i in range(500):
    y_stats.append(comp_stats(all_ycrcb_values[i]['y_values']))
    cr_stats.append(comp_stats(all_ycrcb_values[i]['cr_values']))
    cb_stats.append(comp_stats(all_ycrcb_values[i]['cb_values']))

# Suppress specific runtime warnings
warnings.filterwarnings("ignore", message="RuntimeWarning:")

Testing if point x = 125, y = 160 is within the contour in the test_image

In [17]:
test_contour = find_contours(test_image)[0]

if cv.pointPolygonTest(test_contour, (125, 160), measureDist=False) >= 0:
    print("Point (125, 160) is inside the contour")
else:
    print("Point (125, 160) is outside the contour")
Point (125, 160) is inside the contour

Mean values of Y, Cb and Cr components for the test image (within the contour)

In [18]:
# Test image Y, Cb, Cr means
test_ycrcb = cv.cvtColor(test_image, cv.COLOR_BGR2YCrCb)
test_y_values, test_cr_values, test_cb_values = pixels_in_contour(test_ycrcb, test_contour)
test_mean_y, test_mean_cr, test_mean_cb = np.mean(test_y_values), np.mean(test_cr_values), np.mean(test_cb_values)
print("Test image Y mean: ", test_mean_y, "\nTest image Cr mean: ", test_mean_cr, "\nTest image Cb mean: ", test_mean_cb)
Test image Y mean:  223.90895232815964 
Test image Cr mean:  126.90562638580931 
Test image Cb mean:  133.640243902439

In the second part, the images are converted to YCrCb-format due to openCV's conversion mode.
These are then scanned pixel by pixel to determine their values at their corresponding contours coordinates.
The values are then used to calculate the mean, variance, skewness and kurtosis of each YCrCb component for the sample sets and test images.
Main source: https://docs.opencv.org/3.4/d8/d01/group__imgproc__color__conversions.html#ga397ae87e1288a81d2363b61574eb8cab

Dimension features extraction.

In [19]:
# Function for ellipse fitting
def fit_ellipse(contour):
    # Fit an ellipse to the contour
    ellipse = cv.fitEllipse(contour)
    return ellipse
In [20]:
# Ellipses for all the images (Had to lower the threshold value in find_contours, since all images didn't have contours)
ellipse_list = []
for indx, image in enumerate(image_list):
    contours = contours_list[indx]
    cnt = contours[0] # the first contour
    ellipse = cv.fitEllipse(cnt)
    ellipse_list.append(ellipse)
        
In [21]:
# The rice species in the image_list are: 0-100 Arborio, 101-200 Basmati, 201-300 Ipsala, 301-400 Jasmine, 401-500 Karacadag 
# We only need to extract the contours for the first image of each species
fig, axs = plt.subplots(1, 5, figsize=(15, 4))
for idx, i in enumerate([0, 100, 200, 300, 400]):
    image_copy = image_list[i].copy()
    cv.ellipse(image_copy, ellipse_list[i], (0,255,0), 2)
    
    # Species name fetched from the directory name in subset list
    axs[idx].imshow(image_copy) 
    axs[idx].set_title(os.path.basename(os.path.dirname(subset[i])))
    axs[idx].axis('off')
In [22]:
# Dimension features
MajorAxisLength = []
MinorAxisLength = []
AreaContour = []
PerimeterContour = []
EquivalentDiameter = []
Compactness = []
Shape_Factor_1 = []
Shape_Factor_2 = []
In [23]:
# Calculating the features for all of the images 
for indx, ellipse in enumerate(ellipse_list):
    L = max(ellipse[1]); MajorAxisLength.append(L)
    l = min(ellipse[1]); MinorAxisLength.append(l)
    A = cv.contourArea(contours_list[indx][0]); AreaContour.append(A)
    P = cv.arcLength(contours_list[indx][0], True); PerimeterContour.append(P)
    ED = np.sqrt((4*A) / (np.pi)); EquivalentDiameter.append(ED)
    Co = ED/L; Compactness.append(Co)
    SF1 = L/A; Shape_Factor_1.append(SF1)
    SF2 = l/A; Shape_Factor_2.append(SF2)
    
In [24]:
# Test image dimension features
test_ellipse = fit_ellipse(test_contour)
print("Test image features:")
print("Major axis length: ", max(test_ellipse[1]))
print("Minor axis length: ", min(test_ellipse[1]))
print("Contour area: ", cv.contourArea(test_contour))
print("Contour perimeter: ", cv.arcLength(test_contour, True))
test_ED = np.sqrt((4*A) / (np.pi)) # Used later for comparison
print("Equivalent diameter: ", test_ED)
print("Compactness: ", np.sqrt((4*cv.contourArea(test_contour)/np.pi))/max(test_ellipse[1]))
print("Shape factor 1: ", max(test_ellipse[1])/cv.contourArea(test_contour))
print("Shape factor 2: ", min(test_ellipse[1])/cv.contourArea(test_contour))
Test image features:
Major axis length:  194.51629638671875
Minor axis length:  50.02014923095703
Contour area:  7409.5
Contour perimeter:  430.8355675935745
Equivalent diameter:  80.27369018492085
Compactness:  0.4993367365137288
Shape factor 1:  0.026252283742049902
Shape factor 2:  0.006750813041494977

The third part included ellipse fitting, and plotting of each species. The ellipses were collected to a list which was later used to calculate the features. The test images dimensions are also calculated: Main sources: Original article and
https://docs.opencv.org/3.4/d6/d6e/group__imgproc__draw.html#ga28b2267d35786f5f890ca167236cbc69

All features were gathered into a dataframe, with each row representing one sample and its associated feature values. Information about the original image and the rice species label were also included for each data point. This data was saved as a parquet file in the 'training_data' folder.

In [25]:
# For image label index
species_index =[]
for i in range(5):
    for j in range(100):
         species_index.append(i+1)
         
In [26]:
# Inserting the features into a dataframe
featuresDF = pd.DataFrame({
    'Label': [os.path.basename(os.path.dirname(species)) for species in subset],
    'Original_Image_Path': subset,
    'ImageIndex' : species_index,
    'MajorAxisLength': MajorAxisLength, 
    'MinorAxisLength': MinorAxisLength, 
    'AreaContour': AreaContour, 
    'PerimeterContour': PerimeterContour, 
    'EquivalentDiameter': EquivalentDiameter, 
    'Compactness': Compactness, 
    'Shape_Factor_1': Shape_Factor_1, 
    'Shape_Factor_2': Shape_Factor_2
    })
# Adding the Y, Cr, Cb statistics to the dataframe
featuresDF['Y_mean'] = [y[0] for y in y_stats]
featuresDF['Y_var'] = [y[1] for y in y_stats]
featuresDF['Y_skew'] = [y[2] for y in y_stats]
featuresDF['Y_kurt'] = [y[3] for y in y_stats]
featuresDF['Cr_mean'] = [cr[0] for cr in cr_stats]
featuresDF['Cr_var'] = [cr[1] for cr in cr_stats]
featuresDF['Cr_skew'] = [cr[2] for cr in cr_stats]
featuresDF['Cr_kurt'] = [cr[3] for cr in cr_stats]
featuresDF['Cb_mean'] = [cb[0] for cb in cb_stats]
featuresDF['Cb_var'] = [cb[1] for cb in cb_stats]
featuresDF['Cb_skew'] = [cb[2] for cb in cb_stats]
featuresDF['Cb_kurt'] = [cb[3] for cb in cb_stats]

# Saving the dataframe to /training_data as a parquet file
featuresDF.to_parquet('../training_data/Features.parquet')

Maximum variance of the Cr component for each rice species.

In [27]:
featuresDF.groupby('Label').Cr_var.max()
Out[27]:
Label
Arborio       1.473677
Basmati       7.572945
Ipsala        4.995914
Jasmine      14.479954
Karacadag     1.778424
Name: Cr_var, dtype: float64

Minimum equivalent diameter for each rice species

In [28]:
featuresDF.groupby('Label').EquivalentDiameter.min()
Out[28]:
Label
Arborio       72.273544
Basmati       72.800135
Ipsala       101.422395
Jasmine       69.854782
Karacadag     72.158943
Name: EquivalentDiameter, dtype: float64

Minimum, maximum and median equivalent diameter for Basmati rice samples.
This is compared to the equivalent diameter value of the test_image.

In [29]:
# Basmati equivalent diameter statistics compared to the test image
featuresDF_basmati = featuresDF.iloc[100:200]
print(featuresDF_basmati.EquivalentDiameter.describe())
print('Test image equivalent diameter: ', test_ED)
count    100.000000
mean      85.217352
std        5.050823
min       72.800135
25%       81.896929
50%       85.825367
75%       89.033770
max       94.713300
Name: EquivalentDiameter, dtype: float64
Test image equivalent diameter:  80.27369018492085
In [30]:
# Plotting min and max compactness images
plt.subplot(1, 2, 1)
plt.imshow(image_list[featuresDF[featuresDF.Compactness == featuresDF.Compactness.max()].index[0]])
plt.subplot(1, 2, 2)    
plt.imshow(image_list[featuresDF[featuresDF.Compactness == featuresDF.Compactness.min()].index[0]])
Out[30]:
<matplotlib.image.AxesImage at 0x14ac4363b50>

In order to assess the effect of the compactness value on the shape of the rice, we can illustrate it by plotting the images with the maximum and minimum compactness values, as presented by the code above.
The pictures show that the higher the compactness value, the rounder the rice.

The last part included gathering all of the features in to a dataframe (featureDF).
For each species, these features were used to determine maximum variance of the Cr component and minimum equivalent diameter. The minimum, maximum and median ED was also calculated for the Basmati samples. Lastly, the compactness values affect was evaluated.
Main source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

Data exploration¶

Import training data

In [4]:
df = pd.read_parquet('../training_data/Features.parquet')

Standardization using z-score

In [5]:
# Extract numeric data and standardize
scaler = StandardScaler()
num_data = df.drop(df.columns[[0, 1, 2]], axis=1)
num_data_z = pd.DataFrame(scaler.fit_transform(num_data), columns=num_data.columns)

Histograms

In [6]:
# Labels
labels = num_data_z.columns

#Species
species = df['Label'].unique()

# Colors
colors = ['blue', 'green', 'red', 'cyan', 'magenta']
In [35]:
num_rows = 5
num_cols = 4

# Create the figure and axes
fig, axes = plt.subplots(num_rows, num_cols, figsize=(30, 30))

# Flatten the axes to iterate over them
axes = axes.flatten()

# Iterate over each feature and plot it
for i, feature in enumerate(num_data_z.columns):
    for j in range(5):  # Iterate over each segment of the data
        ax = axes[i]
        ax.hist(num_data_z.iloc[j * 100: (j + 1) * 100][feature], color=colors[j], label=species[j], edgecolor='black', bins=15)
        ax.set_title(feature)
        ax.set_xlabel('Value')
        ax.set_ylabel('Count')
        ax.legend()
plt.tight_layout()
plt.show()
In [7]:
# Convert Nan values to 0
num_data_z.fillna(0, inplace=True)

Pairplots

In [37]:
# Add image index to copy for hue
num_data_z_ind = num_data_z.copy()
num_data_z_ind['ImageIndex'] = df['ImageIndex']

# Pairplot
sns.pairplot(num_data_z_ind, hue='ImageIndex')
Out[37]:
<seaborn.axisgrid.PairGrid at 0x14ae9045b50>

PCA

In [8]:
pca = PCA(n_components=2)
pca_data = pca.fit_transform(num_data_z)

pca_data_w_labels = pd.DataFrame(data=pca_data, columns=['PC1','PC2'])
pca_data_w_labels['species'] = df['Label']

# Plot PCA with scatterplot
sns.scatterplot(x=pca_data_w_labels['PC1'], y=pca_data_w_labels['PC2'], hue=pca_data_w_labels['species'])

# Calculate the cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
print(cumulative_variance_ratio)
[0.29624181 0.50040298]

Analysis¶

Features with discriminative power:

The compactness plot’s results show that for each species, the values mainly do not overlap. This would suggest that the compactness feature has discriminative and predictive power regarding the target variable. Shape Factor 1&2 and MajorAxisLength also has some discriminative power as the bins do not fully overlap and spreads evenly across the z-score range, but compactness feature has the clearest separation.

Correlations:

According to the pairplots, equivalent diameter and contour area features has the highest correlance, as the plots have clear upwards-sloping lines with minimal spread. shape factor’s 1 & 2 negatively correlate to the major- and minor axis lengths, meaning that as the shape factor values rise, the major and minor axis length values decrease and vice versa.

Clusters:

According to the PCA figure, ipsala has a clear cluster. Basmati has some points that fall under jasmine’s outliers but overall, they are distinguishable. Karacadag’s and arborio’s values are close to each other but still distinguishable and karacadag values have minimal spread. Overall, as the data points are well spread out and distinguishable, the PCA reduction appears to have successfully captured the structure of the data and the classification of the rice species should be

Model selection¶

In [9]:
# Separate features (X) and target variable (y)
X = num_data_z
y = df['ImageIndex']
In [10]:
def classify(param, clf):
  kf = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
  # GridSearch CV
  grid = GridSearchCV(clf, param, cv=kf)
  grid.fit(X, y)

  # Report the selected combination of hyperparameters
  print("Best Parameters:", grid.best_params_)

  # Retrieve the cross-validated accuracy for each hyperparameter combination
  cv_accuracy = grid.cv_results_['mean_test_score']

  # Print the accuracy for each hyperparameter combination
  for idx, accuracy in enumerate(cv_accuracy):
      print('Accuracy:', accuracy, grid.cv_results_['params'][idx])
      
  warnings.filterwarnings("ignore", message=".*ConvergenceWarning:.*")
  return grid.best_estimator_
In [41]:
# Random forest model
rfparam = {
    'n_estimators': np.arange(100, 301, 50),
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}
# Classifier
rfc = RandomForestClassifier(random_state=1)

best_est = classify(rfparam, rfc)

# Fit the best model to retrieve feature importance
best_model = best_est
best_model.fit(X, y)

# Feature Importance
print("Feature Importance:")
feature_importance = best_model.feature_importances_
for i, importance in enumerate(feature_importance):
    print(f"Feature {i+1}: {importance}")
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
Accuracy: 0.9933333333333334 {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}

Feature Importance:
Feature 1: 0.08664155916127662
Feature 2: 0.06451432389126312
Feature 3: 0.05443836368559459
Feature 4: 0.04359869029569855
Feature 5: 0.06922875376502739
Feature 6: 0.1938211532258012
Feature 7: 0.09433086926980988
Feature 8: 0.10941897875372361
Feature 9: 0.005806439262458837
Feature 10: 0.02851834945145425
Feature 11: 0.003901500434387335
Feature 12: 0.016979800547849987
Feature 13: 0.015448983162924928
Feature 14: 0.007668586292151838
Feature 15: 0.0029061319429585547
Feature 16: 0.0027766526312003933
Feature 17: 0.09211303445469109
Feature 18: 0.08183896469667078
Feature 19: 0.020887301386461006
Feature 20: 0.005161563688596092
In [42]:
# Support Vector Machine
svmparam = {
    'gamma': ['scale', 'auto'],
    'C' : [0.1, 1, 10, 100],
    'kernel' : ['linear', 'rbf', 'poly']
}
# Classifier
svmf = svm.SVC(random_state=1)

classify(svmparam, svmf)
Best Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Accuracy: 0.9933333333333334 {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}

Out[42]:
SVC(C=0.1, kernel='linear', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=0.1, kernel='linear', random_state=1)
In [11]:
# MLP
mlpparam = {
  'hidden_layer_sizes': [(n,) for n in range(15, 41, 5)],  # Number of neurons in the hidden layer
  'activation': ['tanh', 'relu'],  # Activation functions
  'solver': ['sgd', 'adam'],  # Solvers
  'alpha': [0.01, 0.1, 1],  # L2 regularization strength
  'validation_fraction': [0.1, 0.3]  # Validation fraction
}
# Classifier
mlpc = MLPClassifier(random_state=1)
classify(mlpparam, mlpc)
Best Parameters: {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': (25,), 'solver': 'adam', 'validation_fraction': 0.1}
Accuracy: 0.9900000000000001 {'activation': 'relu', 'alpha': 0.01, 'hidden_layer_sizes': (25,), 'solver': 'adam', 'validation_fraction': 0.1}

Out[11]:
MLPClassifier(alpha=0.1, hidden_layer_sizes=(25,), random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(alpha=0.1, hidden_layer_sizes=(25,), random_state=1)

Analysis¶

For model selection, accuracy was used to determine performance which only captions the correct predictions to all predictions. Apart from cross validating and determining the best hyperparameters based on accuracy, other evaluation methods can be used such as F1-score, which combines the precision and recall scores, which measure the proportion of true positives to all positives and all actual positive instances. These would more efficiently capture the models performance. Area under ROC Curve (AUC) can also be plotted and utilized to estimate performance.

The model selection process can also be done with different cross validation techniques, such as stratified k-fold, which takes into account that each fold has the same proportion of classes and therefore is efficient in imbalanced datasets where certain features may dominate the set. Leave-one-out cross validation can also be considered as it doesn’t have repeated iterations and the used repeated k-fold technique was a computationally heavy process.

Performance estimation¶

Performance estimation using 5-fold repeated cross validation with 3 repetitions and the same parameter ranges as in Part 2 for the inner loop. The random forest tree nested cross-validation was provided beforehand as its computationally very heavy.

For the outer loop we'll use 10-fold Stratified Kfold cross-validation.

In [14]:
# Random Forest Tree hyperparameter tuning and training results
results = pd.read_parquet('../training_data/rf_results.parquet')

Distribution of the selected parameter combinations and mean accuracy and confusion matrix calculations

In [15]:
print(results.best_params.value_counts())

fold_accuracy = results.groupby('fold')['accuracy'].mean()
full_mean = results.accuracy.mean()
print(fold_accuracy)
print(full_mean)

true_class = []

for acc, pred in zip(results.accuracy, results.pred_class):
    if acc == 1:
        true_class.append(pred)
    else:
        true_class.append("else")

print(set(true_class))
conf_matrix = metrics.confusion_matrix(true_class, results.pred_class, labels=['Kar', 'Arb', 'Ips', 'Jas', 'Bas'])
print(conf_matrix)
best_params
{'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 100}    200
{'bootstrap': True, 'max_features': 'sqrt', 'n_estimators': 150}     100
{'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 150}     50
{'bootstrap': True, 'max_features': 'sqrt', 'n_estimators': 300}      50
{'bootstrap': True, 'max_features': 'sqrt', 'n_estimators': 100}      50
{'bootstrap': True, 'max_features': 'sqrt', 'n_estimators': 250}      50
Name: count, dtype: int64
fold
0    0.98
1    1.00
2    1.00
3    1.00
4    1.00
5    1.00
6    0.96
7    1.00
8    1.00
9    1.00
Name: accuracy, dtype: float64
0.994
{'Kar', 'Arb', 'else', 'Jas', 'Ips', 'Bas'}
[[ 99   0   0   0   0]
 [  0  99   0   0   0]
 [  0   0  99   0   0]
 [  0   0   0 100   0]
 [  0   0   0   0 100]]

Nested cross-validation using cross_val_predict function from scikit-learn

In [16]:
feats_Z = ['mean_y_Z', 'var_y_Z',
 'skew_y_Z', 'kurt_y_Z',
 'mean_cr_Z', 'var_cr_Z',
 'skew_cr_Z', 'kurt_cr_Z',
 'mean_cb_Z', 'var_cb_Z',
 'skew_cb_Z', 'kurt_cb_Z',
 'major_axis_length_Z',
 'minor_axis_length_Z',
 'area_Z', 'perimeter_Z',
 'equivalent_diameter_Z',
 'compactness_Z',
 'shape_factor1_Z',
 'shape_factor2_Z']
In [ ]:
y = results['class']
X = results[feats_Z]

rf = RandomForestClassifier(random_state=20)
mlp = MLPClassifier(random_state=20)
svm = SVC(random_state=20)

kf_outer = StratifiedKFold(n_splits=10, random_state=10, shuffle=True)
kf_inner = RepeatedKFold(n_splits=5, n_repeats=3, random_state=5)

n_estimators = range(100, 350, 50)
max_features = ['sqrt', 'log2', None]
bootstrap = [True, False]

rf_parameters={
    'n_estimators': n_estimators,
    'max_features': max_features,
    'bootstrap': bootstrap}

mlp_parameters = {
    'hidden_layer_sizes': [(n,) for n in range(15, 41, 5)],
  'activation': ['tanh', 'relu'],
  'solver': ['sgd', 'adam'],
  'alpha': [0.01, 0.1, 1],
  'validation_fraction': [0.1, 0.3]
}
svm_parameters = {
    'gamma': ['scale', 'auto'],
    'C' : [0.1, 1, 10, 100],
    'kernel' : ['linear', 'rbf', 'poly']
}
gscv_rf = GridSearchCV(rf, rf_parameters, cv=kf_inner, return_train_score=False, n_jobs=-1)
gscv_mlp = GridSearchCV(mlp, mlp_parameters, cv=kf_inner, return_train_score=False, n_jobs=-1)
gscv_svm = GridSearchCV(svm, svm_parameters, cv=kf_inner, return_train_score=False, n_jobs=-1)

# Nested cross-validation and predictions
nested_pred_rf = cross_val_predict(gscv_rf, X, y, cv=kf_outer)
nested_pred_mlp = cross_val_predict(gscv_mlp, X, y, cv=kf_outer)
nested_pred_svm = cross_val_predict(gscv_svm, X, y, cv=kf_outer)

# Accuracy
accuracy_rf = metrics.accuracy_score(y, nested_pred_rf)
accuracy_mlp = metrics.accuracy_score(y, nested_pred_mlp)
accuracy_svm = metrics.accuracy_score(y, nested_pred_svm)

# Confusion matrix
conf_matrix_rf = metrics.confusion_matrix(y, nested_pred_rf)
conf_matrix_mlp = metrics.confusion_matrix(y, nested_pred_mlp)
conf_matrix_svm = metrics.confusion_matrix(y, nested_pred_svm)

print("Random Forest Accuracy:", accuracy_rf)
print("Random Forest Confusion Matrix:")
print(conf_matrix_rf)

print("\nMLP Accuracy:", accuracy_mlp)
print("MLP Confusion Matrix:")
print(conf_matrix_mlp)

print("\nSVM Accuracy:", accuracy_svm)
print("SVM Confusion Matrix:")
print(conf_matrix_svm)
Random Forest Accuracy: 0.994
  Random Forest Confusion Matrix:
  [[ 99   0   0   0   1]
   [  0 100   0   0   0]
   [  0   0  99   1   0]
   [  0   0   0 100   0]
   [  1   0   0   0  99]]
  
  MLP Accuracy: 0.988
  MLP Confusion Matrix:
  [[ 97   0   0   3   0]
   [  0  99   0   1   0]
   [  0   0 100   0   0]
   [  1   1   0  98   0]
   [  0   0   0   0 100]]
  
  SVM Accuracy: 0.998
  SVM Confusion Matrix:
  [[ 99   0   0   1   0]
   [  0 100   0   0   0]
   [  0   0 100   0   0]
   [  0   0   0 100   0]
   [  0   0   0   0 100]]
 
  
       

Results¶

The results of the original article differed quite a bit from our results, as the highest accuracy score was reported for Multi Layer Perceptron at 0.9991, while in our case MLP had a score of 0.988. Support Vector Machine for all of the features had similar accuracy score (0.998) and Random Forest had a lower score in our project (0.994 vs. 0.998). The closeness of the results is still surprising, as the original data set was 75,000, whereas our project had only 500 samples and far fewer features than the original study. The model could be useful in rice plantations but needs to be exposed to more species, samples and different imaging conditions.

Throughout the project, the small sample size had led me to mistakenly believe that prediction accuracy would be poor. With the final accuracies and the nested cross-validation with stratified and repeated k-folds, the most unbiased hyperparameters could be achieved for the best accuracy with even a small dataset, and I was surprised with the precision.

Throughout the project, I learned to inspect, craft and review my code before running it as throughout the project the runtime for the code cells exponentially grew. This was a learning experience, as I noticed that in the beginning, I brute-forced multiple commands as their correctness could be easily checked but later on this was an exhaustive method as the runtimes grew up to 30 minutes per cell.

I believe, that the visualization parts succeeded well as I aimed at making them as informational as possible so that I could also understand the data as well as possible. The methods and data processing techniques used during the project will certainly be useful in the future.